Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add ability to use Tensorflow to train a word2vec model #809

Closed
wants to merge 8 commits into from

Conversation

droudy
Copy link
Contributor

@droudy droudy commented Jul 28, 2016

The flexible architecture of tensorflow allows you to deploy computation to one or more CPUs or GPUs with a single API. The benefit of using tensorflow for training in w2v is that it can distribute computations across GPUs. This PR adds the ability for a user to easily create a w2v model that is trained using tensorflow but still allows gensim w2v methods such as most_similar() and doesnt_match() to be called on the model that is trained with tensorflow.

It works around an existing tensorflow module, tensorflow.models.embedding.word2vec_optimized

Training data can be a gensim style corpus or a text file

model = TfWord2Vec(example_corpus)
model.most_similar("army")

@tmylk
Copy link
Contributor

tmylk commented Aug 2, 2016

Ping @gojomo

@piskvorky
Copy link
Owner

piskvorky commented Aug 2, 2016

What is the status on tensorflow in gensim -- is there a notebook explaning the motivation, comparing the performance / pros / cons?

Also, what is @gojomo 's role here?

from gensim import utils
from six import string_types
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't belong in module scope -- libraries do not set up logging. That's up to applications that use them.

@gojomo
Copy link
Collaborator

gojomo commented Aug 2, 2016

I think the idea of being able to use TF training or import vectors from a TF session is good, but this structuring seems fragile/confusing - especially mixed-overriding & renaming of parameters.

I suspect the better approach would involve some combination of: (1) a new common superclass for what is shared in implementation or interface, and having the TF implementation being a sibling class, rather than patchwork-subclass, of the traditional implementations; (2) moving the vectors-and-vocab entity out of the algorithmic entity, as proposed in #549. Of course such refactoring is a kind of big and disruptive project...

@tmylk tmylk added feature Issue described a new feature difficulty medium Medium issue: required good gensim understanding & python skills labels Oct 4, 2016
@tmylk
Copy link
Contributor

tmylk commented Oct 4, 2016

Current status: Blocked by #549

@tmylk
Copy link
Contributor

tmylk commented Nov 13, 2016

@anmol01gulati This PR can now be updated to use KeyedVecs from #980

@markroxor
Copy link
Contributor

Extending this PR here #1033

@tmylk tmylk closed this Dec 31, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants